Setup

knitr::opts_chunk$set(
  echo = TRUE, 
  dev = "png",
  dpi = 150,
  fig.align = "center",
  comment = NA
)
library(rstan)
## Loading required package: ggplot2
## Loading required package: StanHeaders
## rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)
library(bayesplot)
## This is bayesplot version 1.6.0
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
theme_set(bayesplot::theme_default())

# seed for R's pseudo-RNGs, not Stan's
set.seed(1123) 

The problem

Background

Imagine that you are a statistician or data scientist working as an independent contractor. One of your clients is a company that owns many residential buildings throughout New York City. The property manager explains that they are concerned about the number of cockroach complaints that they receive from their buildings. Previously the company has offered monthly visits from a pest inspector as a solution to this problem. While this is the default solution of many property managers in NYC, the tenants are rarely home when the inspector visits, and so the manager reasons that this is a relatively expensive solution that is currently not very effective.

One alternative to this problem is to deploy long term bait stations. In this alternative, child and pet safe bait stations are installed throughout the apartment building. Cockroaches obtain quick acting poison from these stations and distribute it throughout the colony. The manufacturer of these bait stations provides some indication of the space-to-bait efficacy, but the manager suspects that this guidance was not calculated with NYC roaches in mind. NYC roaches, the manager rationalizes, have more hustle than traditional roaches; and NYC buildings are built differently than other common residential buildings in the US. This is particularly important as the unit cost for each bait station per year is quite high.

The goal

The manager wishes to employ your services to help them to find the optimal number of roach bait stations they should place in each of their buildings in order to minimize the number of cockroach complaints while also keeping expenditure on pest control affordable.

A subset of the company’s buildings have been randomly selected for an experiment:

  • At the beginning of each month, a pest inspector randomly places a number of bait stations throughout the building, without knowledge of the current cockroach levels in the building
  • At the end of the month, the manager records the total number of cockroach complaints in that building.
  • The manager would like to determine the optimal number of traps (\(\textrm{traps}\)) that balances the lost revenue (\(R\)) that complaints (\(\textrm{complaints}\)) generate with the all-in cost of maintaining the traps (\(\textrm{TC}\)).

Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.

Formally, we are interested in finding

\[ \arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})] \]

The property manager would also, if possible, like to learn how these results generalize to buildings they haven’t treated so they can understand the potential costs of pest control at buildings they are acquiring as well as for the rest of their building portfolio.

As the property manager has complete control over the number of traps set, the random variable contributing to this expectation is the number of complaints given the number of traps. We will model the number of complaints as a function of the number of traps.

The data

The data provided to us is in a file called pest_data.RDS. Let’s load the data and see what the structure is:

pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
'data.frame':   120 obs. of  14 variables:
 $ mus                 : num  0.369 0.359 0.282 0.129 0.452 ...
 $ building_id         : int  37 37 37 37 37 37 37 37 37 37 ...
 $ wk_ind              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ date                : Date, format: "2017-01-15" "2017-02-14" ...
 $ traps               : num  8 8 9 10 11 11 10 10 9 9 ...
 $ floors              : num  8 8 8 8 8 8 8 8 8 8 ...
 $ sq_footage_p_floor  : num  5149 5149 5149 5149 5149 ...
 $ live_in_super       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthly_average_rent: num  3847 3847 3847 3847 3847 ...
 $ average_tenant_age  : num  53.9 53.9 53.9 53.9 53.9 ...
 $ age_of_building     : num  47 47 47 47 47 47 47 47 47 47 ...
 $ total_sq_foot       : num  41192 41192 41192 41192 41192 ...
 $ month               : num  1 2 3 4 5 6 7 8 9 10 ...
 $ complaints          : num  1 3 0 1 0 0 4 3 2 2 ...

We have access to the following fields:

First, let’s see how many buildings we have data for:

N_buildings <- length(unique(pest_data$building_id))
N_buildings
[1] 10

And make some plots of the raw data:

ggplot(pest_data, aes(x = complaints)) + 
  geom_bar()

ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) + 
  geom_jitter()

ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) + 
  geom_line(aes(linetype = "Number of complaints")) + 
  geom_point(color = "black") + 
  geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) + 
  facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) + 
  scale_x_date(name = "Month", date_labels = "%b") + 
  scale_y_continuous(name = "", limits = range(pest_data$complaints)) + 
  scale_linetype_discrete(name = "") + 
  scale_color_discrete(name = "Live-in super")

The first question we’ll look at is just whether the number of complaints per building per month is associated with the number of bait stations per building per month, ignoring the temporal and across-building variation (we’ll come back to those sources of variation later in the document). That requires only two variables, \(\textrm{complaints}\) and \(\textrm{traps}\). How should we model the number of complaints?

Bayesian workflow

See slides

Modeling count data : Poisson distribution

We already know some rudimentary information about what we should expect. The number of complaints over a month should be either zero or an integer. The property manager tells us that it is possible but unlikely that number of complaints in a given month is zero. Occasionally there are a very large number of complaints in a single month. A common way of modeling this sort of skewed, single bounded count data is as a Poisson random variable. One concern about modeling the outcome variable as Poisson is that the data may be over-dispersed, but we’ll start with the Poisson model and then check whether over-dispersion is a problem by comparing our model’s predictions to the data.

Model

Given that we have chosen a Poisson regression, we define the likelihood to be the Poisson probability mass function over the number bait stations placed in the building, denoted below as traps. This model assumes that the mean and variance of the outcome variable complaints (number of complaints) is the same. We’ll investigate whether this is a good assumption after we fit the model.

For building \(b = 1,\dots,10\) at time (month) \(t = 1,\dots,12\), we have

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

Let’s encode this probability model in a Stan program.

Writing our first Stan model

  • Write simple_poisson_regression.stan together

Making sure our code is right

However, before we fit the model to real data, we should check that our model works well with simulated data. We’ll simulate data according to the model and then check that we can sufficiently recover the parameter values used in the simulation.

  • Write simple_poisson_regression_dgp.stan together

We can use the stan() function to compile and fit the model, but here we will do the compilation and fitting in two stages to demonstrate what is really happening under the hood.

First we will compile the Stan program (simple_poisson_regression_dgp.stan) that will generate the fake data.

comp_dgp_simple <- stan_model('stan_programs/simple_poisson_regression_dgp.stan')

Printing this object shows the Stan program:

print(comp_dgp_simple)
S4 class stanmodel 'simple_poisson_regression_dgp' coded as follows:
data {
  int<lower=1> N;
  real<lower=0> mean_traps;
}
model {
} 
generated quantities {
  int traps[N];
  int complaints[N];
  real alpha = normal_rng(log(4), .1);
  real beta = normal_rng(-0.25, .1);
  
  for (n in 1:N)  {
    traps[n] = poisson_rng(mean_traps);
    complaints[n] = poisson_log_rng(alpha + beta * traps[n]);
  }
} 

Now we can simulate the data by calling the sampling() function.

fitted_model_dgp <- sampling(
  comp_dgp_simple,
  data = list(N = nrow(pest_data), mean_traps = mean(pest_data$traps)),
  chains = 1,
  iter = 1,
  algorithm = 'Fixed_param',
  seed = 123
  )

SAMPLING FOR MODEL 'simple_poisson_regression_dgp' NOW (CHAIN 1).
Iteration: 1 / 1 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               5.5e-05 seconds (Sampling)
               5.5e-05 seconds (Total)
# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various
# ways of extracting the contents of the stanfit object
samps_dgp <- rstan::extract(fitted_model_dgp)
str(samps_dgp)
List of 5
 $ traps     : num [1, 1:120] 7 5 8 11 9 6 5 6 8 9 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ complaints: num [1, 1:120] 0 1 0 0 0 0 0 0 1 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ alpha     : num [1(1d)] 1.29
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ beta      : num [1(1d)] -0.283
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ lp__      : num [1(1d)] 0
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL

Fit the model to the fake data:

In order to pass the fake data to our Stan program using RStan, we need to arrange the data into a named list. The names must match the names used in the data block of the Stan program.

stan_dat_fake <- list(
  N = nrow(pest_data), 
  traps = samps_dgp$traps[1, ], 
  complaints = samps_dgp$complaints[1, ]
)
str(stan_dat_fake)
List of 3
 $ N         : int 120
 $ traps     : num [1:120] 7 5 8 11 9 6 5 6 8 9 ...
 $ complaints: num [1:120] 0 1 0 0 0 0 0 0 1 0 ...

Now that we have the simulated data we fit the model to see if we can recover the alpha and beta parameters used in the simulation.

comp_model_P <- stan_model('stan_programs/simple_poisson_regression.stan')
fit_model_P <- sampling(comp_model_P, data = stan_dat_fake, seed = 123)

posterior_alpha_beta <- as.matrix(fit_model_P, pars = c('alpha','beta'))
head(posterior_alpha_beta)

Assess parameter recovery

true_alpha_beta <- c(samps_dgp$alpha, samps_dgp$beta)
mcmc_recover_hist(posterior_alpha_beta, true = true_alpha_beta)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We don’t do a great job recovering the parameters here simply because we’re simulating so few observations that the posterior uncertainty remains rather large, but it looks at least plausible (\(\alpha\) and \(\beta\) are contained within the histograms). If we did the simulation with many more observations the parameters would be estimated much more precisely.

We should also check if the y_rep datasets (in-sample predictions) that we coded in the generated quantities block are similar to the y (complaints) values we conditioned on when fitting the model. (The bayesplot package vignettes are a good resource on this topic.)

Here is a plot of the density estimate of the observed data compared to 200 of the y_rep datasets:

y_rep <- as.matrix(fit_model_P, pars = "y_rep")
ppc_dens_overlay(y = stan_dat_fake$complaints, yrep = y_rep[1:200, ])

In the plot above we have the kernel density estimate of the observed data (\(y\), thicker curve) and 200 simulated data sets (\(y_{rep}\), thin curves) from the posterior predictive distribution. If the model fits the data well, as it does here, there is little difference between the observed dataset and the simulated datasets.

Another plot we can make for count data is a rootogram. This is a plot of the expected counts (continuous line) vs the observed counts (blue histogram). We can see the model fits well because the observed histogram matches the expected counts relatively well.

ppc_rootogram(stan_dat_fake$complaints, yrep = y_rep)

Fit with real data

To fit the model to the actual observed data we’ll first create a list to pass to Stan using the variables in the pest_data data frame:

stan_dat_simple <- list(
  N = nrow(pest_data), 
  complaints = pest_data$complaints,
  traps = pest_data$traps
)
str(stan_dat_simple)
List of 3
 $ N         : int 120
 $ complaints: num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
 $ traps     : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...

As we have already compiled the model, we can jump straight to sampling from it.

fit_P_real_data <-
  stan("stan_programs/simple_poisson_regression.stan", 
       data = stan_dat_simple)

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 1).

Gradient evaluation took 1.2e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.097229 seconds (Warm-up)
               0.087068 seconds (Sampling)
               0.184297 seconds (Total)


SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 2).

Gradient evaluation took 1.1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.093976 seconds (Warm-up)
               0.0836 seconds (Sampling)
               0.177576 seconds (Total)


SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 3).

Gradient evaluation took 1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.101434 seconds (Warm-up)
               0.086243 seconds (Sampling)
               0.187677 seconds (Total)


SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 4).

Gradient evaluation took 1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.096386 seconds (Warm-up)
               0.094369 seconds (Sampling)
               0.190755 seconds (Total)

and printing the parameters. What do these tell us?

print(fit_P_real_data, pars = c('alpha','beta'))
Inference for Stan model: simple_poisson_regression.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha  2.57    0.01 0.16  2.26  2.47  2.57  2.68  2.87   817    1
beta  -0.19    0.00 0.02 -0.24 -0.21 -0.19 -0.18 -0.14   823    1

Samples were drawn using NUTS(diag_e) at Tue Sep 11 09:12:05 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We can also plot the posterior distributions:

mcmc_hist(as.matrix(fit_P_real_data, pars = c('alpha','beta')))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mcmc_scatter(as.matrix(fit_P_real_data, pars = c('alpha','beta')), alpha = 0.2)

As we expected, it appears the number of bait stations set in a building is associated with the number of complaints about cockroaches that were made in the following month. However, we still need to consider how well the model fits.

Posterior predictive checking

y_rep <- as.matrix(fit_P_real_data, pars = "y_rep")
ppc_dens_overlay(y = stan_dat_simple$complaints, y_rep[1:200,])

As opposed to when we fit the model to simulated data above, here the simulated datasets is not as dispersed as the observed data and don’t seem to capture the rate of zeros in the observed data. The Poisson model may not be sufficient for this data.

Let’s explore this further by looking directly at the proportion of zeros in the real data and predicted data.

prop_zero <- function(x) mean(x == 0)
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot above shows the observed proportion of zeros (thick vertical line) and a histogram of the proportion of zeros in each of the simulated data sets. It is clear that the model does not capture this feature of the data well at all.

This next plot is a plot of the standardised residuals of the observed vs predicted number of complaints.

mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)

As you can see here, it looks as though we have more positive residuals than negative, which indicates that the model tends to underestimate the number of complaints that will be received.

The rootogram is another useful plot to compare the observed vs expected number of complaints. This is a plot of the expected counts (continuous line) vs the observed counts (blue histogram):

ppc_rootogram(stan_dat_simple$complaints, yrep = y_rep)

If the model was fitting well these would be relatively similar, however in this figure we can see the number of complaints is underestimated if there are few complaints, over-estimated for medium numbers of complaints, and underestimated if there are a large number of complaints.

We can also view how the predicted number of complaints varies with the number of traps. From this we can see that the model doesn’t seem to fully capture the data.

ppc_intervals(
  y = stan_dat_simple$complaints, 
  yrep = y_rep,
  x = stan_dat_simple$traps
) + 
  labs(x = "Number of traps", y = "Number of complaints")

Specifically, the model doesn’t capture the tails of the observed data very well.

Expanding the model: multiple predictors

Modeling the relationship between complaints and bait stations is the simplest model. We can expand the model, however, in a few ways that will be beneficial for our client. Moreover, the manager has told us that they expect there are a number of other reasons that one building might have more roach complaints than another.

Interpretability

Currently, our model’s mean parameter is a rate of complaints per 30 days, but we’re modeling a process that occurs over an area as well as over time. We have the square footage of each building, so if we add that information into the model, we can interpret our parameters as a rate of complaints per square foot per 30 days.

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

The term \(\text{sq_foot}\) is called an exposure term. If we log the term, we can put it in \(\eta_{b,t}\):

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b \end{align*} \]

A quick test shows us that there appears to be a relationship between the square footage of the building and the number of complaints received:

ggplot(pest_data, aes(x = log(total_sq_foot), y = log1p(complaints))) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Using the property manager’s intuition, we include two extra pieces of information we know about the building - the (log of the) square floor space and whether there is a live in super or not - into both the simulated and real data.

stan_dat_simple$log_sq_foot <- log(pest_data$total_sq_foot/1e4)
stan_dat_simple$live_in_super <- pest_data$live_in_super

Stan program for Poisson multiple regression

Now we need a new Stan model that uses multiple predictors.

  • Write multiple_poisson_regression.stan together

Simulate fake data with multiple predictors

comp_dgp_multiple <- stan_model('stan_programs/multiple_poisson_regression_dgp.stan')
fitted_model_dgp <-
  sampling(
  comp_dgp_multiple,
  data = list(N = nrow(pest_data)),
  chains = 1,
  cores = 1,
  iter = 1,
  algorithm = 'Fixed_param',
  seed = 123
  )

SAMPLING FOR MODEL 'multiple_poisson_regression_dgp' NOW (CHAIN 1).
Iteration: 1 / 1 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               6.2e-05 seconds (Sampling)
               6.2e-05 seconds (Total)
samps_dgp <- rstan::extract(fitted_model_dgp)

Now pop that simulated data into a list ready for Stan.

stan_dat_fake <- list(
  N = nrow(pest_data), 
  log_sq_foot = samps_dgp$log_sq_foot[1, ],
  live_in_super = samps_dgp$live_in_super[1, ],
  traps = samps_dgp$traps[1, ], 
  complaints = samps_dgp$complaints[1, ]
)

And then compile and fit the model we wrote for the multiple regression.

comp_model_P_mult <- stan_model('stan_programs/multiple_poisson_regression.stan')
fit_model_P_mult <- sampling(comp_model_P_mult, data = stan_dat_fake, chains = 4, cores = 4)

Then compare these parameters to the true parameters:

posterior_alpha_beta <- as.matrix(fit_model_P_mult, pars = c('alpha','beta','beta_super'))
true_alpha_beta <- c(samps_dgp$alpha,samps_dgp$beta,samps_dgp$beta_super)
mcmc_recover_hist(posterior_alpha_beta, true = true_alpha_beta)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We’ve recovered the parameters sufficiently well, so we’ve probably coded the Stan program correctly and we’re ready to fit the real data.

Fit the real data

Now let’s use the real data and explore the fit.

fit_model_P_mult_real <- sampling(comp_model_P_mult, data = stan_dat_simple)

SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 1).

Gradient evaluation took 3.8e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.38 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.228842 seconds (Warm-up)
               0.223057 seconds (Sampling)
               0.451899 seconds (Total)


SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 2).

Gradient evaluation took 1.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.229064 seconds (Warm-up)
               0.199943 seconds (Sampling)
               0.429007 seconds (Total)


SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 3).

Gradient evaluation took 1.6e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.240913 seconds (Warm-up)
               0.216317 seconds (Sampling)
               0.45723 seconds (Total)


SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 4).

Gradient evaluation took 1.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.252624 seconds (Warm-up)
               0.211165 seconds (Sampling)
               0.463789 seconds (Total)
y_rep <- as.matrix(fit_model_P_mult_real, pars = "y_rep")
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])

This again looks like we haven’t captured the smaller counts very well, nor have we captured the larger counts.

prop_zero <- function(x) mean(x == 0)
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero", binwidth = 0.01)

We’re still severely underestimating the proportion of zeros in the data. Ideally this vertical line would fall somewhere within the histogram.

We can also plot uncertainty intervals for the predicted complaints for different numbers of traps.

ppc_intervals(
  y = stan_dat_simple$complaints, 
  yrep = y_rep,
  x = stan_dat_simple$traps
) + 
  labs(x = "Number of traps", y = "Number of complaints")

We can see that we’ve increased the tails a bit more at the larger numbers of traps but we still have some large observed numbers of complaints that the model would consider extremely unlikely events.

Modeling count data with the Negative Binomial

When we considered modelling the data using a Poisson, we saw that the model didn’t appear to fit as well to the data as we would like. In particular the model underpredicted low and high numbers of complaints, and overpredicted the medium number of complaints. This is one indication of over-dispersion, where the variance is larger than the mean. A Poisson model doesn’t fit over-dispersed count data very well because the same parameter \(\lambda\), controls both the expected counts and the variance of these counts. The natural alternative to this is the negative binomial model:

\[ \begin{align*} \text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b} \end{align*} \]

In Stan the negative binomial mass function we’ll use is called \(\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)\) in Stan. Like the poisson_log function, this negative binomial mass function that is parameterized in terms of its log-mean, \(\eta\), but it also has a precision \(\phi\) such that

\[ \mathbb{E}[y] \, = \lambda = \exp(\eta) \]

\[ \text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi. \]

As \(\phi\) gets larger the term \(\lambda^2 / \phi\) approaches zero and so the variance of the negative-binomial approaches \(\lambda\), i.e., the negative-binomial gets closer and closer to the Poisson.

Stan program for negative-binomial regression

  • Write multiple_NB_regression.stan together

Fake data fit: Multiple NB regression

comp_dgp_multiple_NB <- stan_model('stan_programs/multiple_NB_regression_dgp.stan')

We’re going to generate one draw from the fake data model so we can use the data to fit our model and compare the known values of the parameters to the posterior density of the parameters.

fitted_model_dgp_NB <-
  sampling(
  comp_dgp_multiple_NB,
  data = list(N = nrow(pest_data)),
  chains = 1,
  cores = 1,
  iter = 1,
  algorithm = 'Fixed_param',
  seed = 123
  )

SAMPLING FOR MODEL 'multiple_NB_regression_dgp' NOW (CHAIN 1).
Iteration: 1 / 1 [100%]  (Sampling)

 Elapsed Time: 0 seconds (Warm-up)
               9e-05 seconds (Sampling)
               9e-05 seconds (Total)
samps_dgp_NB <- rstan::extract(fitted_model_dgp_NB)

Create a dataset to feed into the Stan model.

stan_dat_fake_NB <- list(
  N = nrow(pest_data), 
  log_sq_foot = samps_dgp_NB$log_sq_foot[1, ],
  live_in_super = samps_dgp_NB$live_in_super[1, ],
  traps = samps_dgp_NB$traps[1, ], 
  complaints = samps_dgp_NB$complaints[1, ]
)

Compile the inferential model.

comp_model_NB <- stan_model('stan_programs/multiple_NB_regression.stan')

Now we run our NB regression over the fake data and extract the samples to examine posterior predictive checks and to check whether we’ve sufficiently recovered our known parameters, \(\text{alpha}\) \(\texttt{beta}\), .

fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_fake_NB, 
                            chains = 4, cores = 4)
posterior_alpha_beta_NB <- 
  as.matrix(fitted_model_NB,
            pars = c('alpha',
                     'beta',
                     'beta_super',
                     'inv_phi')
  )

Construct the vector of true values from your simulated dataset and compare to the recovered parameters.

true_alpha_beta_NB <- 
  c(samps_dgp_NB$alpha,
    samps_dgp_NB$beta,
    samps_dgp_NB$beta_super,
    samps_dgp_NB$inv_phi
  )
mcmc_recover_hist(posterior_alpha_beta_NB, true = true_alpha_beta_NB)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Fit to real data and check the fit

fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_simple)

SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 1).

Gradient evaluation took 0.000135 seconds
1000 transitions using 10 leapfrog steps per transition would take 1.35 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 1.25161 seconds (Warm-up)
               1.21852 seconds (Sampling)
               2.47013 seconds (Total)


SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 2).

Gradient evaluation took 5.4e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.54 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 1.22864 seconds (Warm-up)
               1.08005 seconds (Sampling)
               2.3087 seconds (Total)


SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 3).

Gradient evaluation took 5.8e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.58 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 1.2827 seconds (Warm-up)
               1.16555 seconds (Sampling)
               2.44825 seconds (Total)


SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 4).

Gradient evaluation took 5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.5 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 1.15397 seconds (Warm-up)
               1.03894 seconds (Sampling)
               2.19291 seconds (Total)
samps_NB <- rstan::extract(fitted_model_NB)

Let’s look at our predictions vs. the data.

y_rep <- samps_NB$y_rep
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])

It appears that our model now captures both the number of small counts better as well as the tails.

Let’s check if the negative binomial model does a better job capturing the number of zeros:

ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These look OK, but let’s look at the standardized residual plot.

mean_inv_phi <- mean(samps_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)

Looks OK, but we still have some very large standardized residuals. This might be because we are currently ignoring that the data are clustered by buildings, and that the probability of roach issue may vary substantially across buildings.

ppc_rootogram(stan_dat_simple$complaints, yrep = y_rep)

The rootogram now looks much more plausible. We can tell this because now the expected number of complaints matches much closer to the observed number of complaints. However, we still have some larger counts that appear to be outliers for the model.

Check predictions by number of traps:

ppc_intervals(
  y = stan_dat_simple$complaints, 
  yrep = y_rep,
  x = stan_dat_simple$traps
) + 
  labs(x = "Number of traps", y = "Number of complaints")

We haven’t used the fact that the data are clustered by building yet. A posterior predictive check might elucidate whether it would be a good idea to add the building information into the model.

ppc_stat_grouped(
  y = stan_dat_simple$complaints, 
  yrep = y_rep, 
  group = pest_data$building_id, 
  stat = 'mean',
  binwidth = 0.2
)

We’re getting plausible predictions for most building means but some are estimated better than others and some have larger uncertainties than we might expect. If we explicitly model the variation across buildings we may be able to get much better estimates.